Efficient Progressive Sampling for Association Rules

نویسنده

  • Srinivasan Parthasarathy
چکیده

In data mining, sampling has often been suggested as an effective tool to reduce the size of the dataset operated at some cost to accuracy. However, this loss to accuracy is often difficult to measure and characterize since the exact nature of the learning curve (accuracy vs. sample size) is parameter and data dependent, i.e., we do not know apriori what sample size is needed to achieve a desired accuracy on a particular dataset for a particular set of parameters. In this article we propose the use of progressive sampling to determine the required sample size for association rule mining. We first show that a naive application of progressive sampling is not very efficient for association rule mining. We then present a refinement based on equivalence classes, that seems to work extremely well in practice and is able to converge to the desired sample size very quickly and very accurately. An additional novelty of our approach is the definition of a support-sensitive, interactive measure of accuracy across progressive samples.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Progressive Sampling based Approach for Effective Mining of Association Rules

ABSRACT Mining Association Rules from huge databases is one of the important issue that need to be addressed. This paper presents a new sampling based association rule mining algorithm that uses a progressive sampling approach based on negative border and Frequent pattern growth (FP Growth) algorithm for finding the candidate item sets which ultimately shortens the execution time in generating ...

متن کامل

Progressive Sampling for Association Rules Based on Sampling Error Estimation

We explore in this paper a progressive sampling algorithm, called Sampling Error Estimation (SEE), which aims to identify an appropriate sample size for mining association rules. SEE has two advantages over previous works in the literature. First, SEE is highly efficient because an appropriate sample size can be determined without the need of executing association rules. Second, the identified ...

متن کامل

A new approach based on data envelopment analysis with double frontiers for ranking the discovered rules from data mining

Data envelopment analysis (DEA) is a relatively new data oriented approach to evaluate performance of a set of peer entities called decision-making units (DMUs) that convert multiple inputs into multiple outputs. Within a relative limited period, DEA has been converted into a strong quantitative and analytical tool to measure and evaluate performance. In an article written by Toloo et al. (2009...

متن کامل

Efficient Algorithms for Discovering Association Rules

Association rules are statements of the form "for 90 % of the rows of the relation, if the row has value 1 in the columns in set W, then it has 1 also in column B". Agrawal, Imielinski, and Swami ntroduced the problem of mining association rules from large collections of data, and gave a method based on successive passes over the database. We give an improved algorithm for the problem. The meth...

متن کامل

Mining Multiple-Level Association Rules in Large Databases

ÐA top-down progressive deepening method is developed for efficient mining of multiple-level association rules from large transaction databases based on the Apriori principle. A group of variant algorithms is proposed based on the ways of sharing intermediate results, with the relative performance tested and analyzed. The enforcement of different interestingness measurements to find more intere...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002